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ABSTRACT 

Recently, the deep-belief-networks (DBN) based voice ac- 
tivity detection (VAD) has been proposed. It is powerful in 
fusing the advantages of multiple features, and achieves the 
state-of-the-art performance. However, the deep layers of the 
DBN-based VAD do not show an apparent superiority to the 
shallower layers. In this paper, we propose a denoising-deep- 
neural-network (DDNN) based VAD to address the aforemen- 
tioned problem. Specifically, we pre-train a deep neural net- 
work in a special unsupervised denoising greedy layer-wise 
mode, and then fine-tune the whole network in a supervised 
way by the common back-propagation algorithm. In the pre- 
training phase, we take the noisy speech signals as the visible 
layer and try to extract a new feature that minimizes the re- 
construction cross-entropy loss between the noisy speech sig- 
nals and its corresponding clean speech signals. Experimental 
results show that the proposed DDNN-based VAD not only 
outperforms the DBN-based VAD but also shows an apparent 
performance improvement of the deep layers over shallower 
layers. 

Index Terms — Deep learning, denoising deep neural net- 
works, voice activity detection. 

1. INTRODUCTION 

Voice activity detectors (VADs) help to separate speech from 
its background noises. They are important frontends of mod- 
ern speech processing systems, such as speech recognition 
systems [1-3] and speech communication systems [4]. Re- 
cently, the machine-learning-based VADs have received much 
attention in that they have the following notable merits. First, 
they can be integrated to the speech recognition systems nat- 
urally. Second, they have strong theoretical bases that guar- 
antee the performance. Third, they can fuse the advantages of 
multiple features [5-9] much better than traditional VADs. 
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The machine-leaming-based VADs can be categorized to 
four groups [10-16]. The first group is the discriminative- 
weight-training -based VADs [10, 12, 16]. They conduct lin- 
ear weighted combinations of multiple features in the orig- 
inal feature space. The second group is the support-vector- 
machine (SVM) based VADs [11,13]. They first fuse multiple 
features to a long feature vector in the original feature space, 
and then project the long feature vector to the kernel-induced 
feature space for better classification performance. The third 
group is the multiple-kemel-SVM (MK-SVM) based VAD 
[14, 15]. It takes the distribution diversity of multiple fea- 
tures into consideration by first projecting different features 
into different kernel spaces and then fuse the features in the 
kernel spaces in a way with linear weighted combination. All 
of the aforementioned three groups utilize shallow models, 
i.e. models with only zero or one hidden layer, which lack the 
ability of describing highly variant features and discovering 
the underlying manifold of the features. 

The fourth group is the deep-belief-networks (DBNs) 
based VAD [17]. Fundamentally, because the DBN [18] 
contains multiple hidden layers, the DBN-based VAD can 
describe highly variant features; because the unsupervised 
pre-training phase of DBN provides an initial point that is 
close to a good solution, the DBN-based VAD has a strong 
generalization ability when compared with other machine- 
leaming-based VADs. However, the deep layers of the DBN- 
based VAD do not yield an apparent superiority to the shal- 
lower layers. In our personal opinion, it might not be proper 
to simply consider VAD as a binary-class classification prob- 
lem with the noisy speech and the background noise as the 
two classes, since the background noise also contributes to 
the distribution of the noisy speech. This might account for 
the inapparent superiority of the deep layers over shallower 
layers in the DBN-based VAD. 

In this paper, we propose a novel denoising deep-neural- 
networks (DDNNs) based VAD. The DDNN training also 
consists of two phases. The first phase is a special unsuper- 
vised denoising greedy layer-wise pre-training phase. The 
pre-training process of each hidden layer tries to extract a 



new feature that minimizes the reconstruction cross-entropy 
loss between the noisy speech signals and its correspond- 
ing clean speech signals (but not the noisy speech signals). 
The second phase is the well-known supervised fine-tuning 
phase. It groups all layers with the pre-trained parameters to 
a whole deep neural networks and tune the parameters for the 
minimum classification error. Experimental results show that 
the proposed DDNN-based VAD not only outperforms the 
DBN-based VAD but also shows an apparent performance 
improvement of the deep layers over shallower layers. 

2. DENOISING-DNN-BASED VAD 

The training process of the DDNN-based VAD consists of 
two phases - unsupervised denoising layer-wise pre-training 
phase and supervised fine-tuning phase, which are presented 
in detail in Sections 2.1 and 2.2 respectively. The overview of 
the DDNN-based VAD is presented in Algorithm 1 . 

2.1. Unsupervised Denoising Layer-wise Pre-training 

Suppose we have D-dimensional noisy speech observa- 
tions (i.e. frames) {xi,yi}"^i and their corresponding 
clean speech observations {xi,2/i}"^j^ with x, = [a;,,d]dLi> 
Ui S {Ho, Hi}, where Xd G [0,1] and Hi/Hq denote the 
speech/noise hypothesis. 

The layer-wise pre-training of each module of DDNN 
consists of optimizing two activation functions jointly. The 
first function, denoted as maps the noisy speech obser- 
vation from the visible layer x to a hidden layer /6i(x). The 
second function, denoted as ge'{-), tries to reconstruct x (but 
not x) from the hidden layer by g$' {fg (x)) . 

The unsupervised pre-training tries to minimize the re- 
construction cross-entropy loss between {x,}"_j and {xi}"_j 
which is defined as follows 

n 

Je,e'{x;^)=^iny2^^^i'''9e' ife (^i))) (1) 
with L (xj; Zj) defined as 

D 

L(xj;zj) = -^(a;i,dlog^;i,d + (1 - Xi^d)\og{l - Zi^a)) 

d=l 

where Zj is short for {fg (xj)). Problem (1) can be solved 
locally by the well-known back-propagation algorithm. 

When we try to pre-train the L-th module with L > 1 (i.e. 
the module is not the lowest one), we should first construct 
its input layer x^^"^^ by transferring x^*^' through the pre- 
trained networks as follows 

x(^-i) = (. . . (. . . (/,(!) (x^°)))))(2) 

where I denotes the l-th hidden layer (i.e. the Z-th layer-wise 
module from the bottom-up), and x^*^' is the original feature 
vector. 



Algorithm 1 Denoising-DNN-based VAD. 

Input: Feature set |xj^°\xj^°\y|°'| , the depth of the 

DDNNL ^ 
Output: Feature extraction model {d^''^}^^^, and the linear 
classifier above the model. 
1: /* Unsupervised denoising layer- wise pre-training */ 
2: for / = 1, . . . , L do 

3: Get 6'(') by solving J^m ^.m (x('-i);x('-i)) defined 

in equation (1) 
4: Calculate x^'^^^ by equation (2) 
5: if Z > 1 then 

6: Get ^C-i) by solving J-^^^,, ^^,(,-1, . x(i-2)) 

or by the contrastive divergence leaming [19]. 
7: Calculate x('~^) by equation (3) 
8: end if 
9: end for 

10: /* Supervised fine-tuning */ 

1 1 : Construct the classification-DDNN and fine-tune it by the 
back-propagation algorithm for the minimum classifica- 
tion error mcnlioncd in Scclion 2.2. 



Here comes the question. What should x^^"^) recon- 
struct? Here, we propose to pre-train a clean-speech to clean- 
speech deep network that accompanies with the noisy-signal 
to clean-signal deep network, so that we can get x^'"^) by 

x(^-^) (.../,-(o (•••/,» (V, (i(")))))(3) 

There are two ways to pre-train the accompanying deep 
network {./^(i)}^^^ (i-C- the deep neural network for the 
clean-speech-to-clean-speech reconstruction) in the layer- 
wise greedy training mode. The first one is to minimize the 
reconstruction cross-entropy loss via (1) with x as both the 
input and the target of the module. Another way is to maxi- 
mize the logrithmic likelihood of x by the efficient contrastive 
divergence algorithm proposed in DBN [19]. In this paper, 
we adopt the former for simplicity. Note that we cannot use 
x(^~i) to recover x^^^ directly for saving the computation 
load of constructing since it's unlikely to describe the 

extraction network {fgm jfSi of the noisy speech simply by 
a single hidden-layer reconstruction network 53/(1) . 

In this paper, all activation functions fg(i) (x^'"^)) and 
5e,(i) (i('-i)) are defined as /g(o (x^'-^)) = s (W(')x('-i) + b^)) 
and 5^,0) (xC-i)) = s (w'^'^x('-i) -h b'^'^) respectively 

with the function s{x) set to the logistic function s{x) = 
1/(1 + e-'') and {W('',b(')} denoted as the weight matrix 
and the bias term between the — l)-th and Z-th layers of the 
network respectively. 

2.2. Supervised Fine-tuning 

The supervised fine-tuning phase can be divided into three 
steps. The first step is to construct the feature extraction 



part of the DDNN by first discarding the function {ggim Jf^^ 
and the accompanying deep networks gg,}f~-^^ and then 
stacking all pre-trained functions {/^(ol^Li layer by layer 
as [18] did. The second step is to add a linear classifier above 
the feature extraction part so as to formulate the entire DDNN. 
The third step is to fine-tune DDNN by the common back- 
propagation algorithm for the minimum classification error 
(MCE), where the cross-entropy loss is also used as the surro- 
gate relaxation function. We call the DDNN for MCE as the 
classification-DDNN. Note that another usage of DDNN is to 
only carry out the first step of the classification-DDNN, and 
then take the extracted denoising features as the input of some 
independent classifiers, such as SVM. We call the DDNN for 
extracting denoising features as the reconstruction-DDNN. 
We only consider the classification-DDNN in this paper. 

3. MOTIVATION AND RELATED WORK 

The proposed algorithm can be viewed as an idea combina- 
tion of the stacked denoising autoencoder (SDAE) [20, 21] 
and speech enhancement techniques [22]. SDAE, proposed 
by Vincent et al. in 2008 [20, 21], is a novel deep learn- 
ing technique that has shown comparable performance with 
DBN. It first adds noise to the original clean features and then 
takes the noisy features as the input of the module that is to 
be pre-trained. But it does not try to reconstruct the noisy 
features. Instead, it tries to recover the original clean features 
by minimizing the cross-entropy loss or the squared error loss 
between the reconstructed features and the original clean fea- 
tures. Compared with SDAE, DDNN also tries to recover the 
clean features, but the noise injected to the clean features is 
from the real environment instead of from artificial addition. 

Speech enhancement techniques, such as the minimum 
mean square error estimation [22], try to estimate the ampli- 
tude of the clean speech from the noisy speech observation, 
which is also known as the a priori signal-to-noise ratio 
(SNR) estimation. The speech enhancement techniques have 
been widely employed in the VAD research, such as the 
well-known Sohn VAD [23]. Compared with the speech en- 
hancement techniques, we construct a deep architecture in a 
machine-learning perspective for the clean speech estimation 
with an assumption that the training data has its correspond- 
ing clean speech target, while some speech-enhancement- 
based VADs assume that the background noise is relatively 
stationary, so that they can trust the statistical parameters 
updated in the silence period for the clean speech estima- 
tion when the speech activity appears. We have to note that 
many speech enhancement techniques do not need the silence 
period for the noise spectrum estimation, such as [24]. 

4. EXPERIMENTS 

Seven noisy test corpora of AURORA2 [25] are used for per- 
formance analysis. Four signal-to-noise ratio (SNR) levels of 



the audio signals are selected, which are [—5, 0, 5, lOjdB re- 
spectively. Each test corpus of AURORA2 contains 1001 ut- 
terances, which are split randomly into three groups for train- 
ing, developing and test respectively. Each training set and 
development set consist of 300 utterances respectively. Each 
test set consists of 401 utterances. Note that the corpora in 
the same background noise scenario but at different SNR lev- 
els are split with the same random seed, and have the same 
manual labels. We concatenate all short utterances in each 
data set to a long one so as to simulate the real-world applica- 
tion environment of VAD. Eventually, the length of each long 
utterance is in a range of (450,750)s long with the percentages 
of speech ranging from 54.57% to 73.32%. 

The sampUng rate is 8kHz. We set the frame length to 
25ms long with a frame-shift of 10ms. We extract 10 acoustic 
features from each observation. The detailed information of 
the features are fisted in Table 1. All features are normahzed 
into the range of [0, 1] in dimension. 

Table 1. Features and their attributes. The subscript of each 
feature is the window length of the feature [26]. 



ID 


Feature 


Dimension 


ID 


Feature 


Dimension 


1 


Pitch 


1 


7 


MFCCie 


20 


2 


DFT 


16 


8 


LPC 


12 


3 


DFTg 


16 


9 


RASTA-PLP 


17 


4 


DFTio 


16 


10 


AMS 


135 


5 


MFCC 


20 




Total 


273 


6 


MFCCg 


20 









The SVM-based VAD, MK-SVM-based VAD, and DBN- 
based VAD are used for comparison. For the SVM-based 
VAD, DBN-based VAD, and DDNN-based VAD, we concate- 
nate all 10 features in serial to a long feature vector and take 
the long feature vector as the input of the classifiers. For the 
MK-SVM-based VAD, we deal with the features in a similar 
way with [27]. 

In respect of the parameter setting, for the SVM-based 
and MK-SVM-based VADs, the Gaussian RBF kernel is used. 
The parameters of SVM and MK-SVM are searched in grid. 
For the DBN-based and DDNN-based VADs, up to three hid- 
den layers are adopted with the numbers of the hidden units 
set to [54, 7, 7] respectively. The learning rate of the unsu- 
pervised pre-training is set to 0.004. The maximum epoch 
of the unsupervised pre-training is set to 200. The leaming 
rate of the supervised fune-tuning is set to 0.005. The maxi- 
mum epoch of the supervised fune-tuning is set to 130. The 
batch mode training is adopted. Each batch contains 512 ob- 
servations. Note that the parameters are selected empirically 
for a compromise between the training time complexity and 
the accuracy. We run all experiments 10 times and report the 
average performances. The reported performance might be 
further improved by tuning the parameters. 

Tables 2 and 3 list the experimental results. The high- 



Table 2. Accuracy comparison in the babble, car, restaurant, and Street noises. The subscripts of the DBN and DDNN are 
the depths (i.e. the numbers of the hidden layers) of the deep neural networks. 





Babble 


Car 


Restaurant 


Street 




-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


SVM 


54.61 


64.46 


75.97 


79.53 


72.20 


81.59 


86.34 


87.60 


69.04 


74.22 


82.09 


84.83 


58.32 


67.98 


74.88 


78.12 


MKSVM 


5.5.4.3 


65.02 


76.17 


80.18 


75.01 


83.50 


86.38 


87.94 


70.44 


75.71 


83.25 


86.30 


63.38 


73.35 


77.60 


79.10 


DBNi 


61.03 


69.01 


7S.S.1 


SO. 99 


77.24 


84.10 


87.18 


88.48 


70.23 


7.5.73 


83.43 


86.12 


66.63 


73.15 


7S.47 


80.42 


DBN2 


60.81 


69.24 


78.94 


81.23 


77.88 


84.14 


87.04 


88.44 


70.10 


75.68 


83.59 


86.08 


67.41 


73.76 


78.70 


80.86 


DBN3 


60.55 


69.38 


79.03 


80.78 


77.75 


83.97 


87.00 


88.14 


69.75 


75.57 


83.54 


85.92 


67.33 


72.83 


79.03 


80.49 


DDNNi 


60.69 


69.42 


78.61 


81.39 


76.06 


83.86 


86.77 


88.17 


69.76 


75.88 


83.47 


86.41 


66.21 


72.21 


79.33 


81.24 


DDNN2 


58.62 


69.07 


78.85 


81.62 


76.80 


84.04 


86.96 


88.54 


69.71 


76.05 


83.90 


86.62 


65.51 


72.72 


79.17 


81.53 


DDNN3 


57.84 


69.61 


79.14 


81.65 


76.82 


84.22 


87.09 


88.67 


69.55 


76.04 


83.78 


86.65 


65.89 


72.82 


79.47 


81.71 



Table 3. Accuracy comparison in the airport, train, and subway noises. "AVR" is short for average. "ALL" denotes that the 

AVR is calculated over all noise types and SNR levels. Note that when we calculate the averages, we did not consider the results 
of the babble noise in —5 and dB, since the manifolds of the speech and background noise are similar in that situation. 





Airport 


Train 


Subway 


AVR over diff. noise types 


AVR 




-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


-5dB 


OdB 


5dB 


lOdB 


ALL 


SVM 


64.48 


74.26 


80.94 


85.21 


66.24 


74.29 


82.91 


85.28 


74.75 


81.24 


83.58 


85.18 


67.51 


75.60 


80.96 


83.68 


76.93 


MKSVM 


65.86 


75.59 


82.30 


85.38 


68.78 


76.31 


83.99 


85.34 


79.90 


84.82 


86.11 


87.46 


70.56 


78.21 


82.26 


84.53 


78.89 


DBNi 


66.18 


76.63 


81.89 


86.63 


68.59 


76.95 


83.65 


85.72 


78.54 


82.70 


85.60 


85.79 


71.24 


78.21 


82.72 


84.88 


79.26 


DBN2 


66.35 


76.66 


81.92 


86.41 


68.99 


76.95 


83.49 


85.68 


79.10 


83.29 


85.77 


86.25 


71.64 


78.41 


82.78 


84.99 


79.46 


DBN3 


66.62 


76.38 


81.85 


86.50 


68.89 


76.14 


83.56 


85.62 


78.95 


83.26 


85.81 


86.01 


71.55 


78.03 


82.83 


84.78 


79.30 


DDNNi 


66.00 


76.61 


82.34 


86.81 


68.59 


77.36 


83.88 


85.94 


77.90 


83.20 


85.84 


86.64 


70.75 


78.19 


82.89 


85.23 


79.27 


DDNN2 


66.80 


76.86 


82.45 


86.98 


69.33 


77.48 


84.21 


86.12 


78.19 


83.39 


85.62 


86.46 


71.06 


78.42 


83.02 


85.41 


79.48 


DDNNs 


67.00 


76.85 


82.30 


86.85 


69.44 


77.60 


84.25 


86.16 


78.53 


83.60 


85.73 


86.49 


71.21 


78.52 


83.11 


85.45 


79.57 



lighted contents of each column are the best performance 
of the referenced DBN-based VAD and that of the DDNN- 
based VAD on the corresponding noise scenario respectively. 
From the two tables, we can see that the deep layers of the 
DDNN-based VADs perform better than the shallower lay- 
ers, which supports our conjecture in Section 3. Also, the 
DDNN-based VAD outperforms the SVM-based VAD and 
the MK-SVM-based VAD. Moreover, the DDNN-based VAD 
even outperforms the DBN-based VAD in several noise sce- 
narios, which demonstrates its effectiveness. The experimen- 
tal phenomenon manifested our conjecture in the introduction 
section about the reason why the deep layers the DBN-based 
VAD does not outperform the shallow layers. That is, the 
manifolds of the clean speech and background noise mixed 
with each other, so that we cannot expect DBN to distinguish 
the background noise from the noisy speech that contains the 
manifolds of both the clean speech and the background noise. 

5. CONCLUSIONS AND FUTURE WORK 

In this paper, we have proposed a denoising-deep-neural- 
networks-based VAD. Specifically, the DDNN training con- 
tains two phases. The first phase is to pre-train a deep neural 
network in an unsupervised denoising greedy layer-wise 
mode. The second phase is to fine-tune the whole deep neu- 
ral network as usual. The denoising pre-training makes the 



DDNN discover the manifold of the clean speech without 
suffering severely from the disruption of the background 
noise. Experimental results have shown that the deep lay- 
ers of the DDNN-based VAD are much more powerful than 
the shallower layers, and moreover, the DDNN-based VAD 
outperforms the DBN-based VAD in several noise scenarios. 

However, to train a DDNN model, the noisy speech train- 
ing corpus needs its corresponding clean corpus, which is an 
ideal situation. Therefore, how to relax this constraint is what 
we focus on in the future work. Moreover, the experiments 
are limited to the matching environments, how to make the 
DDNN-based VAD perform steadily in unmatching environ- 
ments is another key problem we want to address. 

Acknowledgment: The authors would like to thank the 
anonymous referees for their valuable advice, which greatly 
improved the quality of this paper. 
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